Goto

Collaborating Authors

 channel-wise attention


tion at the same layer (R1, R2, R3), and showed better performance than baselines and SE-Nets on image classification, 2

Neural Information Processing Systems

We thank the reviewers for the comments. All reviewers think the paper is clearly written and easy to read. We address reviewers' concerns below. We will include these statistics in the paper. All these suggest that the improvement is not simply due to the increased model size.







Controllable diffusion-based generation for multi-channel biological data

Zhang, Haoran, Zhou, Mingyuan, Tansey, Wesley

arXiv.org Artificial Intelligence

Spatial profiling technologies in biology, such as imaging mass cytometry (IMC) and spatial transcriptomics (ST), generate high-dimensional, multi-channel data with strong spatial alignment and complex inter-channel relationships. Generative modeling of such data requires jointly capturing intra- and inter-channel structure, while also generalizing across arbitrary combinations of observed and missing channels for practical application. Existing diffusion-based models generally assume low-dimensional inputs (e.g., RGB images) and rely on simple conditioning mechanisms that break spatial correspondence and ignore inter-channel dependencies. This work proposes a unified diffusion framework for controllable generation over structured and spatial biological data. Our model contains two key innovations: (1) a hierarchical feature injection mechanism that enables multi-resolution conditioning on spatially aligned channels, and (2) a combination of latent-space and output-space channel-wise attention to capture inter-channel relationships. To support flexible conditioning and generalization to arbitrary subsets of observed channels, we train the model using a random masking strategy, enabling it to reconstruct missing channels from any combination of inputs. We demonstrate state-of-the-art performance across both spatial and non-spatial prediction tasks, including protein imputation in IMC and gene-to-protein prediction in single-cell datasets, and show strong generalization to unseen conditional configurations.


Reviews: Controllable Text-to-Image Generation

Neural Information Processing Systems

The paper is well-organized and written, which can be followed easily. In particular, instead of generating a new image from the text, the authors pay more attention to image manipulation based on the modified natural language description. For the word-level spatial and channel-wise attention driven generator: (1) The novelty and effectiveness of attentional generator may be limited. Specifically, the paper designs a word-level spatial and channel-wise attention driven generator, which has two attention parts (i.e. However, since the spatial attention is based on the method in AttnGAN [7], most contributions may lie on the additional channel-wise part.


Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Sang, Mufan, Zhao, Yong, Liu, Gang, Hansen, John H. L., Wu, Jian

arXiv.org Artificial Intelligence

Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the enhanced locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset. The proposed models achieve 0.75% EER on VoxCeleb 1 test set, outperforming the previously proposed Transformer-based models and CNN-based models, such as ResNet34 and ECAPA-TDNN. When trained on the MS-internal dataset, the proposed models achieve promising results with 14.6% relative reduction in EER over the Res2Net50 model.


PDANet: Polarity-consistent Deep Attention Network for Fine-grained Visual Emotion Regression

Zhao, Sicheng, Jia, Zizhou, Chen, Hui, Li, Leida, Ding, Guiguang, Keutzer, Kurt

arXiv.org Artificial Intelligence

Existing methods on visual emotion analysis mainly focus on coarse-grained emotion classification, i.e. assigning an image with a dominant discrete emotion category. However, these methods cannot well reflect the complexity and subtlety of emotions. In this paper, we study the fine-grained regression problem of visual emotions based on convolutional neural networks (CNNs). Specifically, we develop a Polarity-consistent Deep Attention Network (PDANet), a novel network architecture that integrates attention into a CNN with an emotion polarity constraint. First, we propose to incorporate both spatial and channel-wise attentions into a CNN for visual emotion regression, which jointly considers the local spatial connectivity patterns along each channel and the interdependency between different channels. Second, we design a novel regression loss, i.e. polarity-consistent regression (PCR) loss, based on the weakly supervised emotion polarity to guide the attention generation. By optimizing the PCR loss, PDANet can generate a polarity preserved attention map and thus improve the emotion regression performance. Extensive experiments are conducted on the IAPS, NAPS, and EMOTIC datasets, and the results demonstrate that the proposed PDANet outperforms the state-of-the-art approaches by a large margin for fine-grained visual emotion regression. Our source code is released at: https://github.com/ZizhouJia/PDANet.